Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: aibek Date: Nov 19, 2013 5:18am
Forum: texts Subject: Re: dead links / downloading pdf files

By the way, the following is how I have been searching for the files.

The documentation:
https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server

Please do CDX queries responsibly. (These are highly resource intensive.)

All the files archived from www.thdl.org/community/pdfs/:
http://web.archive.org/cdx/search/cdx?url=http://www.thdl.org/community/pdfs/&;matchType=prefix&limit=1000&
output=json

You can construct the Wayback Machine links by the following formula. (Both timestamp and original without the double-quotes of course.)
web.archive.org/timestamp/original


All the files archived from old.thdl.org/community/pdfs/:
http://web.archive.org/cdx/search/cdx?url=http://old.thdl.org/community/pdfs/&;matchType=prefix&limit=1000&
output=json

The field ‘original’ contains the full url, so using regex I am checking if a file called ‘schoolrep.pdf’ is found at any location on thdl.org:
http://web.archive.org/cdx/search/cdx?url=thdl.org&;matchType=host&output=json&limit=50&filter=original:.*schoolrep.pdf

Using regex, I search for all pdf files with size between 1MB and 10MB:
http://web.archive.org/cdx/search/cdx?url=thdl.org&;matchType=host&output=json&limit=50&filter=length:.......&filter=mimetype:application/pdf

(Note that in the above, limit is set to 50, so it returns only the first 50 entries. No point in wasting IA resources by asking it to run pointless errands!)

Note also that the ‘length’ field does not exactly correspond to the file size. I don’t know what it is, but I know that it is always approximately equal to the file size that IA has.

This post was modified by aibek on 2013-11-19 13:18:55